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ABSTRACT 

As the number of people who use scientific Uterature databases 
grows, the demand for literature retrieval services has been steadily 
increased. One of the most popular retrieval services is to find a 
set of papers similar to the paper under consideration, which re- 
quires a measure that computes similarities between papers. Scien- 
tific literature databases exhibit two interesting characteristics that 
are different from general databases. First, the papers cited by old 
papers are often not included in the database due to technical and 
economic reasons. Second, since a paper references the papers pub- 
lished before it, few papers cite recently-published papers. These 
two characteristics cause all existing similarity measures to fail in 
at least one of the following cases: (1) measuring the similarity 
between old, but similar papers, (2) measuring the similarity be- 
tween recent, but similar papers, and (3) measuring the similarity 
between two similar papers: one old, the other recent. In this paper, 
we propose a new link-based similarity measure called C-Rank, 
which uses both in-link and out-link by disregarding the direction 
of references. In addition, we discuss the most suitable normaliza- 
tion method for scientific literature databases and propose an eval- 
uation method for measuring the accuracy of similarity measures. 
We have used a database with real-world papers from DBLP and 
their reference information crawled from Libra for experiments and 
compared the performance of C-Rank with those of existing simi- 
larity measures. Experimental results show that C-Rank achieves a 
higher accuracy than existing similarity measures. 

Categories and Subject Descriptors: 1.5.3 [Clustering] Similarity 
measures 

General Terms: Measurement, Reliability 

Keywords: Scientific Literature, Link-based Similarity Measure 

1. INTRODUCTION 

As the number of people who use scientific literature databases 
grows, the demand for scientific literature retrieval services has 
been steadily increased. One of the most popular retrieval services 
is to find a set of papers similar to the paper under consideration. 



Permission to make digital or hard copies of all or part of this work for 
personal or classroom use is granted without fee provided that copies are 
not made or distributed for profit or commercial advantage and that copies 
bear this notice and the full citation on the first page. To copy otherwise, to 
republish, to post on servers or to redistribute to lists, requires prior specific 
permission and/or a fee. 

Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00. 



which requires a measure that computes similarities between pa- 
pers. Various similarity measures, either based on keywords or ref- 
erences, have been proposed in the field of information retrieval 
m . Text-based similarity measures count the number of keywords 
in common between two papers. Link-based similarity measures 
transform the reference information in a paper into directed links 
and compute the similarity score between papers using graph-based 
methods (2) (3). 

Intuitively, two scientific papers are considered similar when the 
research problems dealt in those papers are similar. Text-based sim- 
ilarity measures are not suitable in this regard, since they may con- 
clude two papers are similar as long as the context is similar even 
when the problems the papers tackle are different (Tj. Link-based 
measures, on the other hand, use the reference created by the au- 
thors to the papers that solve similar problems. Therefore, similar- 
ity measures based on the reference information tend to be more 
consistent with people's view on which papers are similar l4ll5l . 
In this paper, we propose a new link-based similarity measure for 
scientific literature databases. 

There have been many link-based similarity measures in the lit- 
erature (2l(3l(5l(7)(8l(9)(l0l(II](l2l(T3l. Typical link-based sim- 
ilarity measures include Bibliographic Coupling (Coupling) (2), 
Co-citation (3), Amsler (7), rvs-SimRank [5], SimRank (8), and 
P-Rank |5]. In Co-citation, the similarity between two objects is 
computed based on the number of objects that reference both ob- 
jects (i.e., in-link). The more objects that reference both objects, 
the higher similarity score of two objects (3). In Coupling, the sim- 
ilarity between two objects is computed based on the number of 
objects which are referenced by both of them (i.e., out-link). The 
more objects that are referenced by both objects, the higher sim- 
ilarity score of two objects [3. Amsler measures the similarity 
between two objects as a weighted sum of the similarity scores by 
Coupling and by Co-citation |[7j. SimRank improves the accuracy 
of Co-citation by computing the similarity score iteratively. The 
iterative computation of similarity captures the recursive intuition 
that two objects are similar if they are referenced by similar objects 
(8). Rvs-SimRank and P-Rank improves Coupling and Amsler, re- 
spectively, in the similar way 15]. 

Scientific literature databases exhibit two unique characteristics 
that do not exist in general databases. First, few papers exist which 
are referenced by old papers. This is because very old papers are 
often not included in the database due to technical and economic 
reasons. Second, since a paper can reference only the papers pub- 
lished before it (and never the papers published after it), there exist 
few papers which reference recently-published papers. 

These two characteristics in a scientific literature database cause 
all existing link-based similarity measures to fail in at least one of 
the following three cases: (I) measuring the similarity between old 



papers, (2) measuring the similarity between recent papers, and (3) 
measuring tlie similarity between an old paper and a recent one. 

First, Coupling, which uses out-link, may compute the similarity 
score between two old but similar papers as near 0, because there 
exist few papers that are referenced by both of them in the database. 
Second, Co-citation, which uses in-link, on the other hand, may 
compute the score between two recent but similar papers as near 
0, because there exist few papers which reference both papers in 
the database. Third, both Coupling and Co-citation may compute 
the score between two similar papers, one old and the other recent, 
as near 0, because the old paper tends to have few papers that are 
referenced by it and the recent one tends to have few papers that 
reference it. Other similarity measures are plagued with similar 
problems, which are discussed in detail in Section 2. 

Two papers p and q should be determined similar in the follow- 
ing three cases. First, p and q are similar if the number of papers 
referenced by both p and q (out-links) is high. Second, p and q 
are similar if the number of papers which reference both p and q 
(in-links) is high. Third, p and q are similar if many of the papers 
that are referenced by p reference q. Though the first and the sec- 
ond cases are captured in Coupling and Co-citation, respectively, 
but they fail to address both cases simultaneously. Moreover, no 
existing measures can be used for the third case. 

To compute the similarity score correctly regardless of the pub- 
lished dates of papers, one should consider all three cases simul- 
taneously. In other words, one should employ all three measures: 
Coupling for computing the similarity between recent papers, Co- 
citation for computing the similarity between old papers, and a new 
measure for computing the similarity between an old and a recent 
papers. This can be achieved by transforming both out-links and 
in-links into undirected links and computing the similarity based 
on the number of papers 'connected' by two papers. In this paper, 
we propose C-Rank, a new similarity measure that computes the 
similarity properly for all three cases. 

Existing similarity measures use various normalization methods 
to prevent the similarity score between two papers from increasing 
as the number of links to and from the papers increases 1 8 1 1 1 1] 1 1 4 1 . 
Typical normalization methods include laccard coefficient, used in 
Coupling, Co-citation, and Amsler, and the pairwise method, used 
in rvs-SimRank, SimRank, and P-Rank. In this paper, we show that 
Jaccard coeffiecient is more suitable than the pairwise method for 
scientific literature databases through experiments. 

The ideal similarity measure should match the intuition of users, 
and the best way to evaluate similarity measures is to employ hu- 
mans jS]. In this paper, we point out the problems with the evalu- 
ation methods used in previous studies and propose a new method 
that solves those problems. We use the proposed evaluation method 
in our experiments. 

The paper consists of the following. Section 2 points out the 
problems with existing similarity measures when applied to sci- 
entific literature databases. Section 3 describes C-Rank, the de- 
tailed algorithm, and the suitable normalization method. Section 4 
compares the accuracy of C-Rank with those of existing measures 
through experiments. Section 5 summarizes and concludes the pa- 
per. 

2. RELATED WORK 

In this section, we examine existing link-based similarity mea- 
sures and discuss why they fail to measure similarity correctly when 
used for scientific literature databases. 

2.1 Link-Based Similarity Measures 

Existing link-based similarity measures include Co-citation, Cou- 



pling, Amsler, SimRank, rvs-SimRank, and P-Rank JS]. Co-citation, 
Coupling, and Amsler were proposed for measuring similarity among 
scientific papers (5), and were applied to different types of objects 
with link information I15II16II17I . SimRank, rvs-SimRank, and 
P-Rank, on the other hand, were originally proposed for general 
objects with link information fSlfSl. 

In Co-citation, the similarity between two objects is computed 
based on the number of objects that have in-links to both objects. 
Equation 1 represents Co-citation, p and q denote objects, S{p, q) 
the similarity score between p and q, and J(p) the set of in-link 
neighbors of p. 

S(p,q) = l(p)nl{q) (1) 

In Coupling, the similarity between two objects is computed 
based on the number of objects that have out-links from both ob- 
jects. Equation 2 represents Coupling. 0{p) denotes the set of 
out-link neighbors of p. 

S{p,q) = 0{p)nO{q) (2) 

Amsler measures the similarity between two objects as a weighted 
sum of the similarity scores by Coupling and by Co-citation. Equa- 
tion 3 represents Amsler. The relative weight of the similarity score 
of Co-citation and that of Coupling is balanced by parameter A. In 
most applications, A is set at 0.5 13 Q . 

S{p, q) = Xx (/(p) n I(q)) + (1 - A) X (0(p) n 0(g)) (3) 

Figure 1 shows an example of a reference graph, a to j repre- 
sent papers and arrows represent reference relations between pa- 
pers. The similarity score between e and / by Co-citation is 1, 
because there is one paper i that references both papers. The score 
between e and / by Coupling is 1, because a single paper b is refer- 
enced by both. The score between e and / by Amsler is 1, assuming 
the relative weight for Coupling and Co-citation is 0.5. 




Figure 1: A reference graph. 



On the other hand, Co-citation computes the score between o 
and c as and the score between d and <; as 1. A closer look re- 
veals that d references a and that g references c. Since the papers 
with the similarity score of 1 (d and g) reference them, a and c 
may be regarded somewhat similar. SimRank captures this intu- 
ition such that the objects referenced by similar objects are similar. 
That is, SimRank computes the similarity score recursively. Equa- 
tion 4 represents SimRank. In Equation 4, Rt (p, q) denotes the 
similarity score between p and q at iteration k , and h (p) denotes 
the paper connected to p through i-th in-link. C is a decay factor 
for attenuating the similarity score during similarity propagation, 
where C G [0, 1]. 



rectly in scientific literature databases, at least in one of the follow- 
ing three cases. 



Rk+i{p, 



c 



\Hp)\\ii<i)\ 



\i(p)\ 
i=l j=l 



(4) 



By using globalized neighbors, SimRank improves the accuracy 
of Co-citation which uses localized neighbors only. Similarly, rvs- 
SimRank and P-Rank improve Coupling and Amsler, respectively. 
Equation 5 represents rvs-SimRank. The only difference between 
rvs-SimRank and SimRank is the type of links used. Equation 6 
represents P-Rank. As shown in Equation 6, P-Rank measures the 
similarity score between two objects as a weighted sum of the sim- 
ilarity scores by rvs-SimRank and SimRank. 
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1 if p = q 



Rk+i{p,q) = 
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(5) 



Ro{p,q) 



if p / g 

1 if p = g 



(PI) measuring the similarity between old, but similar pa- 
pers 

(P2) measuring the similarity between recent, but similar 
papers 

(P3) measuring the similarity between two similar papers: 
one old, the other recent 



Figure 2 represents the reference relations among papers as a 
graph. In Figure 2, a to I represent papers, and arrows represent the 
reference relations between papers. The papers on top of the figure 
are older, and the papers at bottom are more recent. An example 
of (PI) happens when the similarity score between a and b is com- 
puted. The similarity score computed by Coupling (rvs-SimRank) 
is (near 0) because these papers have no out-links. The similar- 
ity score by Amsler (P-Rank) is not 0, because the score by Co- 
citation is 1. The maximum score by Amsler (P-Rank), however, 
would be at most 0.5 (assuming the relative weight for Coupling 
and Co-citation is 0.5). That is, the score by Amsler (P-Rank) is 
inaccurate. An example of (P2) happens when the score between k 
and I is computed. The score computed by Co-citation (SimRank) 
is (near 0) because these papers have no in-links. The score by 
Amsler (P-Rank) would be 0.5 (near 0.5), even though they have 
a common out-link neighbor i. An example of (P3) happens when 
the score between e and I is computed. The score computed by all 
existing similarity measures is or near 0. 
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Table 1 summarizes the existing similarity measures (5|. When 
fc = 1, C = 1, and A = 1 (or A = 0), Equation 6 represents 
Co-citation (or Coupling). When fc = 1, C = 1 and A = 0.5, 
Equation 6 represents Amsler. When k — oo. Equation 6 repre- 
sents SimRank, rvs-SimRank, and P-Rank depending on the value 
of A. Even though k — oo for SimRank, rvs-SimRank, and P- 
Rank, empirically the similarity scores by SimRank, rvs-SimRank, 
and P-Rank tend to converge at fc = 4 or 5 Ii5jl8jj. 



Table 1: Relationship among similarity measures (Adopted 
from (^) 



Links used k 


In-link 


Out-link 


Both 


k=l 


Co-citation 
C=l, A=l 


Coupling 
C=l, A=0 


Amsler 
C=l, A=l/2 


k=CxD 


SimRank 
C=varies, A=l 


rvs-SimRank 
C=varies, A=0 


P-Rank 
C, A=varies 



2.2 Problems with Existing Similarity Measures 

Scientific literature databases have two characteristics that are 
different from general databases. First, very old papers are often 
not in the database. Second, there exist few papers that reference 
recently-published papers. Due to these two characteristics, all ex- 
isting similarity measures fail to compute the similarity score cor- 




Figure 2: A graph of the reference relationships with publish- 
ing dates. 



Coupling, Co-citation, and Amsler fail to capture the similar- 
ity between papers in scientific literature databases. Rvs-SimRank, 
SimRank, and P-Rank are plagued with the same problems, since 
they are the iterative extensions of Coupling, Co-citation, and Am- 
sler, respectively. 



3. PROPOSED SIMILARITY MEASURE 

In this section, we propose a new similarity measure called C- 
Rank and describe its algorithm in detail. We also discuss a nor- 
malization method appropriate for the new measure. 



3.1 Main Idea 

Two papers p and q should receive a high similarity score in the 
following three cases. 



(CI) the number of papers referenced by both p and q is high 
(C2) the number of papers which reference both p and q is 
high 

(C3) the number of the papers which are referenced by p 
reference q is high 



We define the paper which is referenced by both papers as OP 
(common Out-link Paper), paper which references both papers as 
IP (common In-link Paper), and paper which is referenced by the 
one paper and references the other as BP (common Between Paper). 
In Figure 2, for example, / is an OP of g and h, h is an IP of d and 
/, and c is a BP of a and /. 

The existing measures can be used in (CI) and (C2) cases. Co- 
citation or SimRank can be used for (CI), and Coupling or rvs- 
SimRank can be used for (C2). In Figure 2, for example, Co- 
citation (SimRank) can be used to measure the similarity between 
g and h, and Coupling (rvs-SimRank) can be used to measure the 
similarity between d and /. The existing measures, however, can- 
not correctly measure the similarity in (C3). In Figure 2, for exam- 
ple, existing measures fail to compute the similarity between a and 
/. A similarity measure that counts BPs should be suitable for this 
case. Of course, a BP-based similarity measure cannot be used for 
the papers with publication dates close to each other, such as g and 
h, since there exist few BPs between the papers under considera- 
tion. 

To compute the score correctly in all three cases, therefore, we 
propose to use all three measures, Co-citation (or SimRank), Cou- 
pling (or rvs-SimRank), and a new BP-based measure. When com- 
bining all three measures, a weighted sum of similarity scores from 
the three measures could have been used, similar to Amsler (or 
P-Rank). Note that this would suffer the same problem faced by 
Amsler (or P-Rank) that one of the scores may be near 0, which re- 
sults in the score that is much lower than the correct value. Instead 
of using a weighted sum, therefore, we propose a new measure that 
considers three cases simultaneously. 

3.2 C-Rank 

Though papers are classified into OPs, IPs, and BPs based on the 
direction of links, their role is the same: they are used to compute 
the similarity between two papers. So, we disregard the direction of 
links, which results in a single type of links that connect two papers. 
We define the papers which connect two papers as Connectors. 
When disregarding the direction of references. Coupling (or rvs- 
SimRank), Co-citation (or SimRank), and a BP-based similarity 
measure are unified as a single measure that computes the similarity 
score based on the number of Connectors in an undirected graph. 

We propose a Connector-based similarity measure called C-Rank. 
C-Rank uses both in-links and out-links at the same time. Equation 
7 represents C-Rank, where L{p) denotes the set of undirected link 
neighbors of paper p. Similar to that the accuracy of Co-citation 
(Coupling) is improved by iterative SimRank (rvs-SimRank), C- 
Rank is defined iteratively. 

, , r if p / (? 
Ro{p,q) = \^ 1 ifp = g ' 

^ \L{.P)\\Hl)\ 

R^+^iP^^)= \L{p)\\L{q)\ ^ Y.^R^{L.{p),LAq)) (7) 

Unlike Amsler or P-Rank, C-Rank does not need parameter A, 
because C-Rank unifies in-links and out-links into undirected links. 



Furthermore, C-Rank has the effect similar to increasing the weight 
of Co-citation (SimRank) when computing the score between old 
papers, increasing the weight of Coupling (rvs-SimRank) when 
computing the score between recent papers, and increasing the weight 
of a BP-based similarity measure when computing the score be- 
tween old and recent papers. The user does not have to set the 
value of A when using C-Rank. In experiments, we show that the 
accuracy of C-Rank is higher than those of Amsler (P-Rank) with 
different A values. 

One of the evaluation criteria for link-based similarity measures 
is how many pairs of objects can be measured (5]. SimRank fails 
to compute the similarity when a paper has no in-link, and rvs- 
SimRank fails when a paper has no out-link. Although being able 
to compute the similarity scores for more pairs than any other mea- 
sures, P-Rank measures similarity for less number of pairs than 
C-Rank. This is because P-Rank fails to compute the similarity be- 
tween an old paper and a recent one. In experiments, we show that 
the number of pairs of papers computed by C-Rank is more than 
that of any other measures. 

Treating both in-links and out-links as undirected might be thought 
to result in loss of semantics of the direction of links. By disregard- 
ing the direction of links, however, C-Rank is able to consider all 
three cases mentioned in 2.2. Thus, the measure has more advan- 
tages than disadvantages when computing the similarities among 
papers. 

3.3 Normalization 

In previous studies, two types of normalization methods are used 
to prevent a problem that the similarity score between two papers 
increases as the number of links increases. Used in Coupling (or 
Co-citation), Jaccard coefficient normalizes the similarity score by 
dividing the number of papers which are referenced by (or refer- 
ence) both papers by the sum of the number of the papers each 
paper references (or is referenced by) | 14] . Rvs-SimRank, Sim- 
Rank, and P-Rank have used the pairwise normalization method. 
SimRank, for example, builds a set of pairs between the papers that 
reference any one of the two papers under consideration, computes 
the sum of similarity scores of all pairs, and divides it by the prod- 
uct of the number of in-links to each paper. 

In scientific literature databases, some well-known papers are 
referenced by the many other papers, and people who use retrieval 
services would be interested in those quality papers. Since the pair- 
wise normalization method lowers the similarity score of the pa- 
pers with many in-links, the similarity scores between two famous 
papers can be very low |1H . Figure 3 represents an example of 
the problem with the pairwise normalization method. In Figure 3, 
papers p and q are referenced by all the other papers, and should 
be determined similar. When the number of papers which refer- 
ence both p and q is k, however, the similarity score with pairwise 
normalization becomes i . The same problem exists when the sim- 
ilarity is computed iteratively, although the score may be somewhat 
higher than i (JJJ. So, for the scientific literature databases where 
famous papers (in which users would be interested) have many 
in-links, Jaccard coefficient seems a better normalization method. 
Equation 8 represents C-Rank with Jaccard coefficient. In Equation 
8, '\' denotes different set. In experiments, we show Jaccard co- 
efficient is more suitable than pairwise normalization for scientific 
literature databases. 

3.4 Recursive C-Rank 

The recursive C-Rank in Equation (8) has the following four 
properties. For any papers p and q, the iterative C-Rank of p and q 
is the same as that of q and p (symmetry). The iterative C-Rank is 




Figure 3: An example showing the problem with the pairwise 
normalization method. 



non-decreasing during similarity computation (monotonicity). Ex- 
istence and uniqueness guarantee that there exists a unique solution 
to iterative C-Rank which reaches a fixed point by iterative compu- 
tation. The prove can be found in Appendix. 



(Symmetry) 

(Monotonicity) 

(Existence) 



(Uniqueness) 



Rk{p,q) = Rk{q,p) 
< Rk{v,q) < Rk+i{p,q) < 1 
The solution to the iterative C-Rank equa- 
tions always exists and converges to a fixed 
point, s(*, *), which is the theoretical so- 
lution to the recursive C-Rank equations. 
The solution to the iterative C-Rank equa- 
tion is unique when C 7^ 1. 



Ro{p,q) = 



ifp/g 

1 if p = g 



1 



\L{p)UL{q)\\L{q)\ 



p'eL{p)\L{q) q'eHq) 



+ 



L{p)uL{q)\\Lip)\ 



E E R'^ip'^i') 



p'€L(p) q'&L{q)\L{p) 



(8) 



3.5 Algorithm 

Table 2 shows the algorithm of C-Rank. For every pair of papers 
{p,q), an entry R(p,q) maintains the intermediate C-Rank score 
of (a,b) during iterative computation. Because the fc-th iterative C- 
Rank score is computed based on C-Rank scores in the (fc — 1) — th 
iteration, an auxiliary similarity score store R * {jp, q) is maintained 
accordingly. The code first initializes Ro{p,q) based on Table 2 
(Lines 1~4). During iterative computation, R*{*, *), is updated by 
R(*,*) in the fc — 1 iteration, based on Table 2 (Lines 6~17). Then 
Rk{*,*) is substituted by Rk+i{*,*) for further iteration (Lines 
18~20). This iterative procedure is repeated k times (Lines 5~21). 

The space complexity of all existing measures are 0{n^) be- 
cause the measures must store pairs of all papers. Let and be 
the average number of in-links and out-links of all papers, respec- 
tively, the time complexity of rvs-SimRank, SimRank, and P-Rank 
are 0{k- df ■n'^), 0{k ■ d^ ■ n^), and 0{k • (d? + dl) ■ n^)), respec- 
tively (5l- The time complexity of C-Rank is 0{k ■ (di +d2)^ - n^), 
which is slightly higher than the others. However, the worst case 
time complexity of all existing iterative measures including C-Rank 
is 0{n*). 

The time complexity of C-Rank may become too high. There 



have been many methods to improve the time complexity of Sim- 
Rank OHKIIlQlllinillTll. These methods can be applied to C- 
Rank, because the equation of C-Rank and that of SimRank are 
similar. 



4. EXPERIMENTS 

In this section, we compare the effectiveness of C-Rank and the 
existing similarity measures. 

4.1 Experimental Setup 

Our experiments ran on a scientific literature database with pa- 
pers from DBLP' and reference information crawled from Libra^. 
We used the papers related to the database research because the 
running time of the existing similarity measures and C-Rank in 
the large database can become very high. We used the publication 
venues listed in 1191 to select papers related to database research. 
Table 3 lists the publication venues in |19| . The number of pa- 
pers was 23,795 and the number of references (to the papers in the 
dataset) was 126,281. All our experiments were performed on an 
Intel PC with Quad Core 2.67GHz CPU, running Windows 2008. 
We compared C-Rank with rvs-SimRank, SimRank, and P-Rank, 
because Coupling, Co-citation, and Amsler could be expressed us- 
ing rvs-SimRank, SimRank, and P-Rank, respectively. For fairness 
of comparison, we set the decay factor C = 0.8 for all measures 
and the relative weight A to be 0.5 for P-Rank, unless otherwise 
noted. All the default values of parameters are set in accordance 
with 0. 



Table 2: The C-Rank Algorithm 



C-Rank (G, C, k) 



Input: A reference graph G (an undirected graph), 
the decay factor C, the iteration number k 
Output: C-Rank score R(*, *) 



1 

2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 



foreach p G G do /* Initialization */ 
foreach g G G do 
if p == q then R{p, g) = 1 
else R{p, q) — 
while (n < k) do /* Iteration */ 
foreach p G G do 
foreach g G G do 

^*[P,q) - lL{p)UL{q)\ 

foreach Ip G L{p)\L{q) 
foreach Iq G L(q) 

differentSetofp -l-= R{lp, Iq) 
differentSetofp x= ^ 



\L(p)UL(q)\\L{q)\ 

foreach Ip G L{q)\L(p) 
foreach Iq G L{p) 

differentSetofq -l-= R{lp, Iq) 
differentSetofq x= ^L(p)uHq)\\L(p)\ 
R* {p,q)+ = Cx (differentSetofp -1- differentSetofq) 
foreach p G G do /* Update */ 
foreach g G G do 
R{p, q) ^ R* (p, g) 
n = n + 1 
return R(*,*) 



http://www.informatic.uni-trier.de/~ley/db/ 
^ http://academic.research.microsoft.com 



4.2 Accurate Evaluation Method 

Previous studies on similarity measures used various evaluation 
methods. (2) and (3) evaluated Coupling and Co-citation qualita- 
tively, showing some example cases. Although easy to use, how- 
ever, qualitative evaluations do not provide any concrete evidence 
on which measure is better or how accurate each measure is. (S) 
used a text-based similarity measure and Co-citation as ground truth 
to evaluate the accuracy of SimRank. Because the text-based simi- 
larity measure is less accurate than SimRank, and Co-citation does 
not generate similarity scores accurately at least in scientific liter- 
ature databases, using these two measures as ground truth do not 
seem a good evaluation method for scientific literature databases. 
(5) clustered papers using the similarity score by SimRank and the 
similarity score by P-Rank, respectively, and evaluated the accu- 
racy of two measures by comparing the similarity scores of papers 
from the same cluster and those from different clusters. Although 
used for evaluating the quality of clustering in clustering research, 
this method is not suitable for evaluating the similarity measure be- 
cause the results are dependent on the type of data and clustering 
algorithm (20). 

Table 3: Publication venues related to database research fl9l 

ADBIS, ADC, ARTDB, BNCOD, CDB, CIKM, CoopIS, 
DANTE, DASFAA, DAWAK, DB, DBPL, DBSEC, DEXA, 
DKD, DKE, DL, DMKD, DNIS, DOLAR DOOD, DPD, 
DPDS, DS, EDBT, ER, FODO, FOIKS, FQAS, GIS, HPTS, 
ICDE, ICDM, ICDT, ICIS, IDA, IDEAL, IDEAS, IGSI, Inf. 
Process, Lett., Inf. Sci., Inf. Syst., IPM, IQIS, ISF, 
ISR, IW-MMDBMS, IWDM, JDM, JUS, JMIS, K-CAPKA, 
KDD, KER, KIS, KR, MDA, MFDBS, MLDM, MMDB, 
MSS, NLDB, OODBS, PAKDD, PKDD, PODS, RIDE, RIDS, 
SIGKDD Exp., SIGMOD, SIGMOD Rec, SSD, SSDMB, 
TKDE, TODS, TOIS, TSDM, UIDIS, VDB, VLDB, VLDB-J, 
WebDB, WIDM, WISE, XMLEC 



One of the most accurate ways to evaluate the accuracy of a sim- 
ilarity measure would be to ask humans [8], but user studies are ex- 
pensive and time consuming. We propose a new evaluation method 
that achieves similar effects without employing user studies. We 
ask domain experts to select the papers similar to each other, and 
evaluate each similarity measure based on the similarity score be- 
tween the selected papers. The higher the score is, the more accu- 
rate the similarity measure is. 

The evaluation process in detail is as follows. First, we select 
five well-known fields in data mining (clustering, sequential pat- 
tern mining, graph mining, spatial databases, link mining) and se- 
lect the references at the end of each chapter for each field from a 
data mining text book 1141 . The references include both old and 
recent papers. Second, we use one of the references to be a query 
paper and find the m highest scoring papers (where m can be 10, 
20, 30, 40, and 50) by each similarity measure. Third, we compute 
the precision of each similarity measure by comparing the m high- 
est scoring papers to those in the reference list of the field of the 
query paper. Fourth, we repeat the second and third steps until all 
references are used as a query paper. 

4.3 Experimental Results 

4.3.1 Normalization Method 

In this section, we compare the accuracy of similarity measures 
with Jaccard coefficient and that with the pairwise normalization 
method. Figure 4 shows the accuracy of P-Rank and C-Rank with 



different normalization methods. (The other measures, rvs-SimRank 
and SimRank, exhibit similar results, and thus omitted.) The accu- 
racy of both similarity measures with Jaccard coefficient is higher 
than that with pairwise normalization. The results confirm that Jac- 
card coefficient is a more suitable normalization method for scien- 
tific literature databases. Note that the accuracy of C-Rank with 
pairwise normalization is lower than that of P-Rank with pairwise 
normalization. This is because C-Rank uses more links than P- 
Rank as mentioned 3.3. 

4.3.2 Top 10 Rankings 

In this section, we confirm that C-Rank measures similarity prop- 
erly by extracting the top 10 highest-scoring papers by C-Rank 
when paired with a well-known paper as a query paper. We use 
1211 and 1221 . two well-known papers in the database and data min- 
ing research field, respectively. Table 4 lists top 10 highest-scoring 
papers when paired with |2I| . and Table 5 lists the top 10 highest- 
scoring papers when paired with 1221 . I2II proposed R-Tree as 
a multidimensional index. In Table 4, the highest-scoring papers 
by C-Rank are mostly related to multidimensional indexes. 1221 
proposed BIRCH as a clustering method. In Table 5, the papers 
by C-Rank are mostly related to clustering. The results show that 
C-Rank can provide a set of papers similar to the paper under con- 
sideration. 

(%) "P-Rank with pairwise P-Rank with Jaccard 
0.600 n 



0.500 




10 20 30 40 50 (m) 



(a) P-Rank 



^ "C-Rank with pairwise C-Rank with Jaccard 
0.600 n 




10 20 30 40 50 (m) 



(b) C-Rank 

Figure 4: Comparing Jaccard coefficient and pairwise normal- 
ization method. 

4.3.3 Failure of Existing Similarity Measures 

In this section, we demonstrate the problem of existing similarity 
measures when applied to scientific literature databases using three 
cases identified in Section 2.2. We also show that C-Rank computes 



the similarity score properly in all three cases. For demonstration 
purposes, we select 1231 . 1241 and 1211 . 1251 as the pairs of old 
papers but similar papers, 1261 , |27| and 1281 . 1291 as the pairs of 
recent papers but similar papers, and 1251 . 1301 and 1311 . 1321 as the 

pairs of an old and a recent paper. 

Table 6 shows the result of case analysis. Six cases are illus- 
trated in Table 6, but all other examples tested show similar re- 
sults. In Table 6, the similarity scores between old but similar pa- 
pers by rvs-SimRank are in both cases. As noted in Section II.B, 
rvs-SimRank identifies incorrectly that the papers are not similar 
because they have no common out-links. Similarly, the similar- 
ity scores between recent but similar papers by SimRank are in 
both cases. SimRank identifies incorrectly that the papers are not 
similar because they have no common in-links. Furthermore, all 
existing similarity measures compute the similarity scores between 
the papers with different publication dates as 0. C-Rank is the only 
one that measures the similarity of those papers. That is, C-Rank 
is able to capture the similarity between the papers with different 
publishing dates. Note that the scores by C-Rank are not high in 
both cases. This is because the problem tackled in the old paper 
and that in the newer paper, although somewhat similar, have be- 
come less in common as time passes on. The original problem may 
have changed to a more specific problem, or it may have changed 
to solve more general problem, etc. 



Table 4: Top 10 papers similar to {13 



1 


The R*-Tree: An Efficient and Robust Access Method ... 


2 


The R-l-Tree: A Dynamic Index for Multi-Dimensional ... 


3 


Nearest Neighbor Queries 


4 


The K-D-B-Tree: A Search Structure For Large ... 


5 


The X-tree : An Index Structure or ... 


6 


On Packing R-trees 


7 


The Grid File: An Adaptable, Symmetric Multikey ... 


8 


Efficient Processing of Spatial Joins Using R-Trees 


9 


Hilbert R-tree: An Improved R-tree using Fractals 


10 


The SR-tree: An Index Structure for High-Dimensional ... 


Table 5: Top 10 papers similar to (22) 


1 


Efficient and Effective Clustering Methods ... 


2 


CURE: An Efficient Clustering Algorithm ... 


3 


A Density-Based Algorithm for Discovering Clusters ... 


4 


Automatic Subspace Clustering of High Dimensional ... 


5 


Scaling Clustering Algorithms to Large Databases 


6 


WaveCluster: A Multi-Resolution Clustering Approach ... 


7 


Fast Algorithms for Projected Clustering 


8 


STING: A Statistical Information Grid Approach ... 


9 


An Efficient Approach to Clustering in Large ... 


10 


OPTICS: Ordering Points To Identify the Clustering... 



Table 6: The results of case analysis 





old papers 


recent papers 


an old 
and a recent paper 




|23| and (24] 


|26| and 127J 


125J and |30| 




(m and (m 


l28l and |29l 


|3T1 and 132| 


rvs-SimRank 






0.278 
0.189 






SimRank 


0.179 
0.141 










P-Rank 


0.114 
0.082 


0.198 
0.096 






C-Rank 


0.240 


0.282 


0.050 


0.175 


0.210 


0.047 



4.3.5 Distribution of Similarity Scores 

In this section, we count the number of pairs whose similarity is 
computable by each similarity measure. Figure 6 shows the distri- 
bution of the similarity scores by each similarity measure. In Figure 
6, X-axis represents the range of similarity scores, where [lb, ub) in- 
dicates lb is included and ub is not included in the range, and y-axis 
represents the number of pairs of papers. In Figure 6, y-axis is in 
log scale, because for most pairs, the similarity scores are either in 
N/A or in [0, 0.1). N/A represents the pairs whose similarity cannot 
be measured. As shown in Figure 6, there are no such pairs of pa- 
pers whose similarity scores are N/A by C-Rank. This implies that 
C-Rank computes the similarity score between all pairs of papers 
because C-Rank uses both in-link and out-link simultaneously. In 
Figure 6, the pairs of papers whose similarity scores are N/A by the 
other measures can be thought to be computed as near by C-Rank. 
However, we note that the number of pairs in [0, 0.1) by C-Rank 
is not too much different from those of other measures. This result 
indicates that C-Rank provides meaningful similarity scores for the 
pairs of papers even when their computation is infeasible with the 
other similarity measures. 
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'SimRank 

'P-Rank 

C-Rank 
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Figure 5: The accuracy of the similarity measures. 



4. 3. 4 Accuracy of Similarity Measures 

Figure 5 represents the accuracy of different similarity measures. 
In Figure 5, x-axis represents the number of top m scoring papers, 
and y-axis represents the accuracy of each similarity measure. As 
shown in Figure 5, the accuracy of C-Rank is higher than the other 
similarity measures regardless of the value of m. The results indi- 
cate that C-Rank is more accurate than the other measures in scien- 
tific literature databases. 



4.3.6 Similarity Scores with Variations of the Num- 
ber of Iterations 
In this section, we examine the algorithmic nature of similarity 
measures by tracing the changes in the similarity score while vary- 
ing k. Figure 7 represents the average of the similarity scores of 
the 10 highest-scoring pairs of papers while varying k from 1 to 10. 
In Figure 7, x-axis represents the number of iterations, and y-axis 
represents the average of the scores of the top 10 highest-scoring 
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N/A [0-0.1) [0.1~0.2} [0.2-0.3} [0.3-0.4) [0.4-0.5) [0.5-1.0) Similarity Score 

Figure 6: Distributions of tlie similarity scores. 



4.3.8 Accuracy of Similarity Measures with Varia- 
tions of the Relative Weight 
So far, we have used the relative weight to be 0.5 in P-Rank. 
In this section, we compare the accuracy of C-Ranlc and those of 
P-Rank with variations of A. The A is set to be 0.3, 0.5, and 0.8. 
Figure 9 represents the accuracy of C-Rank and P-Rank with vari- 
ations of A. In Figure 9, x-axis represents the number of the top 
m scoring papers, and y-axis represents the accuracy of each sim- 
ilarity measure. The accuracy of C-Rank is higher than those of 
P-Rank regardless of the value of A in most cases. Although the 
accuracy of P-Rank with A = 0.8 is higher than that of C-Rank in 
two cases, when m = 40 and m = 50, the similarity score is more 
important when m is low, especially in scientific literature retrieval 
services, and C-Rank achieves a higher accuracy than P-Rank when 
m is 10, 20, and 30. 



pairs of papers by rvs-SimRank, SimRank, P-Rank, and C-Rank, 
respectively. The similarity score Rk{*,*) becomes more accu- 
rate on successive iterations. Iteration 2, which computes i?2(*, *) 
from Ri (*, *), can be thought of as the first iteration taking advan- 
tage of the recursive power of algorithms for similarity computa- 
tion. Subsequent changes become increasingly minor, suggesting 
a rapid convergence. The score by SimRank converges at fc = 3, 
the score by rvs-SimRank converges at A; = 5, the score by P-Rank 
converges at fc = 6, and the score by C-Rank converges at = 9. 
Because it utilizes the highest number of links, C-Rank is the last 
one to converge. 

similarity score 
0.20 
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0.14 
0.12 
0.10 
0.08 
0.06 
0.04 
0.02 
0.00 
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Figure 7: Tlie similarity scores witli different k values. 



4. 3. 7 Similarity Scores with Variations of Decay Fac- 
tor 

In this section, we show how the decay factor C is related to the 
speed of convergence in C-Rank. Figure 8 represents the average 
similarity scores by C-Rank with variations of C. In Figure 8, x- 
axis represents the number of iterations, and y-axis represents the 
average similarity score by the top 10 highest-scoring pairs of pa- 
pers. The decay factor, C, is set to be 0.2, 0.5, and 0.8, respectively. 
It is obvious that the similarity score of C-Rank increases with the 
increase of C. When C = 0.2, C-Rank converges fast at A: = 2. 
When C — 0.8, on the other hand, C-Rank converges at the 9-th 
iteration. When C is low, the recursive power of C-Rank is weak- 
ened such that only the papers in local or near-local neighborhood 
are used in similarity computation. When C is high, more papers in 
a more global neighborhood can be used in computing the similar- 
ity recursively. When C is high, therefore, the convergence takes 
more time. 
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Figure 8: Tlie similarity scores with different C values. 
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Figure 9: The accuracy of C-Rank and P-Rank with different 
A values. 



5. CONCLUSIONS 

In this paper, we propose C-Rank, a new similarity measure for 
scientific literature databases. We examine two notable characteris- 
tics in scientific literature databases and identify three cases where 
all existing similarity measures fail to compute the similarity score 
correctly. Our observations lead to the development of C-Rank, 
which uses both in-link and out-link while disregarding the direc- 
tion of references. In addition, we verify Jaccard coefficient is more 
appropriate for scientific literature databases, and propose an eval- 
uation method for measuring the accuracy of similarity measures. 
For experiments, we have built a database with real papers from 
DBLP and reference information crawled from Libra. Experimen- 



tal results show that C-Rank achieves a higher effectiveness than 
existing similarity measures in most cases. 
The contributions of this paper are as follows: 

1 . We have pointed out that existing similarity measures fail to 
compute the similarity score properly for scientific papers. 

2. We have proposed a new similarity measure for computing 
the similarity score among papers called C-Rank. 

3. We have proposed a normalization method suitable for sci- 
entific literature databases. 

4. We have proposed a quantitative evaluation method which 
matches the intuition of users. 
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7. APPENDIX 

We prove following four mathematical properties: 



The above equation represents following 



1. (Symmetry) According to Equation(8), it is Rk{a,b) = 
Rk{b,a) forfc > 0. 

2. (Monotonicity) If a — b, Ro{a,b) — Ri{a,b) = ... = 1, so 
it is that the monotonicity property holds. We consider a ^ b. 
According to Equation(8), -Ro(a, b) — 0. Base on Equation(8), 
< Ri{a,b) < 1. So, < Ro{a,b) < Ri{a,b) < 1. We 
assume that for all k,0 < iife_i(a, 6) < Rk{a, b) < 1, then 



Rk-i{a,b) - Rk{a,b) = Cx\^ 
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Based on the assumption, we have (/?fe(a, 6) — 7ifc_i(a, 6)) > 0, 
Va, fe £ G, so the left hand side Rk+i{a,b) — Rk{a,b) > 
holds. By induction, we draw the conclusion that for any k, 
Rk < Rk+i- And based on the assumption, < Rk{a, b) < 1, 
so 
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SO, i?fc+i(a, fe) < C < 1. By induction, we know that for any 
fc, < Rk{a,b) < 1. 

3. (Existence) According to (Monotonicity), Wa,b £ G, Rk{a,b) 
is bounded and nondecreasing as k increase. By the Com- 
pleteness Axiom of calculus, each sequence Rk{a,b) con- 
verges to a limit i? (a, b) G [0,1]. Note limfc_>oo iifc(a, fe) ~ 
limfe-s-oo Rk+i{a, b) = R{a, b). So we have 
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Note that the limit of Rk{*, *), with respect to k, right satisfies 
the recursive C-Rank equation, shown in Equation(8). 

4. (Uniqueness) Suppose si(*, *) and S2{*, *) are two solution to 
the iterative C-Rank equations, for any entities x,y £ G, 
let 5{x,y) — si{x,y) — S2{x,y) be their difference. Let 
M — maxx,y !^(^, y)\ be the maximum absolute value of any 
difference. We need to show that AI — 0. Let \S{x, y)\ = M for 
some a,b £ G. It is obvious that 71/ = if a = b. otherwise. 
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Thus, 
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